Enforce READ COMMITTED isolation when using mysql #15714

SamWheating · 2021-05-07T13:28:25Z

In the investigation in the above issue, we found that MySQL's default isolation level ended up causing some really weird issues when running multiple schedulers.

This ended up causing skipped tasks while backfilling and caused DAGs using depends_on_past to completely deadlock.

By enforcing READ COMMITTED when using MySQL we should be able to avoid these issues entirely. I don't expect that this will introduce any issues when using MySQL 5.x and/or a single scheduler, as READ COMMITTED is the default isolation level used by PostgreSQL.

I haven't had a chance to actually test this implementation as we fixed the errors in our production Airflow environment by updating the isolation level on the database itself. If you'd like, I can set replicate the issue with a default MySQL DB and ensure that this fix works as expected.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

ashb · 2021-05-07T13:39:59Z

@potiuk Possibly one more for 2.0.3

potiuk · 2021-05-07T13:40:51Z

@potiuk Possibly one more for 2.0.3

Yep. I see.

ashb · 2021-05-07T13:41:26Z

Could you split the dialect change to a separate PR please?

github-actions · 2021-05-07T13:42:13Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

github-actions · 2021-05-07T14:16:58Z

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the workflow link to check the reason.

ashb · 2021-05-07T16:24:52Z

Looks like we test the engine args @SamWheating -- could you fix the tests?

potiuk · 2021-05-07T16:36:30Z

Yep. I'd love to merge it and include in 2.0.3

SamWheating · 2021-05-07T17:45:16Z

Tests should be fixed now 🤞, they actually surfaced an indentation error which caused the isolation level to remain unset if connection pooling was disabled.

potiuk · 2021-05-07T17:46:40Z

IT WORKS !

potiuk · 2021-05-07T20:24:54Z

Still, one more test to fix :(

github-actions · 2021-05-07T23:11:23Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2021-05-07T23:12:00Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

SamWheating · 2021-05-08T17:32:23Z

😩 my bad, I forgot to run the tests locally with all backends. Should be good to go now 🤞

Update: Got some other CI failures now, but I don't think they're related to my changes, I'll rebase on a fresh master and see if it fixes anything 🤔

potiuk · 2021-05-08T18:02:55Z

Update: Got some other CI failures now, but I don't think they're related to my changes, I'll rebase on a fresh master and see if it fixes anything

It should. We had a few problems with failing master due to too hasty merges and some selective checks slipping through the cracks, but this is all solved now.

potiuk · 2021-05-09T08:44:42Z

Random failure. Merging :)

* Enforce READ COMMITTED isolation when using mysql * Fixing up tests, removing indentation * Fixing test (cherry picked from commit 231d104)

alippai · 2021-05-14T21:05:48Z

I'm confused a little bit about this change. If decreasing the isolation level helps, then necessary COMMITs are missed. If anything, changing the isolation level to SERIALIZABLE should be the solution. Were these solutions considered? Airflow is a scheduler, dropping transaction safety upon concurrency issues simply sounds like a super-weird solution, are we sure about this?

potiuk · 2021-05-16T12:25:56Z

I think the change is needed mostly because of the locking behavior in those different isolation levels.

We are using row-level locking in MySQL 8 with HA scheduler and if you look at the behaviour of the default REPEATABLE READ if some search criteria are used when locking the row, MYSQL will also lock the gaps between locked rows matching the criteria not only the locked rows, which likely causes the deadlocs mentioned by @SamWheating .

In case of READ COMMITTED the gaps between rows are never locked. There is a problem that inserts could create phantom rows in this case, but this is really not the case in our way of using the locks (they are read/update locks, not inserts). Even if more rows are initially locked during the WHERE clause, after the Where clause is evealuated and rows are selected for locking, for the rows that are not selected, the locks are freed.

SERIALIZABLE behaves more like REPEATABLE READ but it converts the selects into SELECT FOR SHARE so those are "red allowed /write locked". But I think this does not solve the problem because in case of HA schedulers we have several schedulers trying to run SELECT FOR UPDATE on the same table with some selection criteria, which means that the deadlocks are still possible.

As far as I see the change is sound :)

More information here: https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html

@SamWheating @ashb - did I get this correctly ?

@alippai - does it sound convincing?

SamWheating · 2021-05-16T12:38:52Z

Hey @alippai, thanks for bringing this up. I am in agreement with Jarek's comment and don't expect this to introduce any new issues. However if there's something I've missed then I am happy to reevaluate these changes.

In general, if you all feel that this change is risky or too opinionated (as we're potentially overriding people's mySQL configuration without making it super clear), we could revert and approach this problem differently, either by:

Mentioning this in the docs and telling people to set this isolation at their database level.
Moving the isolation_level out to a config option, similar to the connection pooling parameters.

Thoughts?

alippai · 2021-05-16T12:47:46Z

I think what you wrote is correct, but this change is a little bit different. What you describe would be achieved by starting a new transaction with the selected isolation level, using exclusive locks, submitting COMMIT between them or killing one of the transactions in the deadlocked HA case (if that's really a concurrency issue).

The change in this PR affects all the Airflow operations. It might be a working workaround if the transactions are not committed / rolled back in a timely manner, but I'm not sure that avoiding deadlocks by undocumented implicit assumptions on how the different parties use all the tables is the right one. Explicit exclusive locks on the problematic transactions tend to be more maintainable (unless we see major performance degradation)

The issue above doesn't happen with INSERTs, however I'd assume INSERTs are happening to the same table from different scheduler instances (or UI / API) and this change may have a side-effect for those.

alippai · 2021-05-16T12:54:11Z

I agree that this still is a valid solution and opinionated is good (as you know the usage and I don't).
The important point is on that I don't know the usage and if I'd have to contribute to the scheduler it could blow into my face (as MySQL is not protecting me, I have to be aware of the different parties doing the exact queries).

The change is right, I would just reduce the scope of these "unsafe" regions by starting select transactions only with the READ COMMITED isolation level or putting exclusive write locks into the current ones (which should likely prevent deadlocks if no other party is allowed to use the resource at all).

alippai · 2021-05-16T13:01:13Z

Docs already cover that we are using SKIP LOCKED and it needs a modern DB. SKIP LOCKED is also a great tool for preventing deadlocks with the smallest scope possible. https://github.com/apache/airflow/blob/master/docs/apache-airflow/concepts/scheduler.rst#database-requirements

Disregard my concerns, if these are wrong solutions which wouldn't help your issue, I just didn't see the discussion about this yet and felt like we brining the nukes instead of the scalpel :)
The fact you took the time to answer in detail already reassures me that this is not something what just "slipped through"

potiuk · 2021-05-16T13:20:49Z

I think what also matters is how SQLAlchemy works and how we are using it.. I am not really concerned too much about changing the default to READ_COMMITTED for all transactions, because:

a) we have very short transactions usually and we are usually working on the same set of rows/tables as we retrieve in the first query in the transaction

b) SQL Alchemy will retrieve the rows we are working on and store them as objects in memory and only when we flush them /commit transaction SQL alchemy will merge the change back.

c) Scheduler works (in 2.0) in a small "tight" loops. Basically, it will retrieve N records (say first 100 matching the crirteria), lockng them and then only that thread of that scheduler will perform any changes to those rows and related data - merging them back). Then it commits and goes back and retrieves the next 100 matching rows. So the contention and parallel access to same rows is not really possible (under normal circumstances).

Also the scheduler (which is the important one) uses indeed SKIP LOCKED (but I believe locking the same Gaps by different schedulers might cause the deadlocks in some scenarios even if SKIP LOCKED is used).

@ashb -> I might not have the whole picture so maybe you can comment here. BTW. I think it might be one of interesting topics of the talk of yours at the Summit :) https://twitter.com/AshBerlin/status/1393861492282429443

saolof · 2024-06-17T17:11:37Z

Reading this in hindsight, this change is weird.

Postgres uses read committed as the default primarily because of backwards compatibility, it throws exceptions at higher isolation levels if it detects a write conflict, and the race detector was added more than a decade after the initial release because it required a huge amount of work. The postgres read committed implementation technically provides MAV (it guarentees a constant view of the database within any single statement) which is stronger than mysql's read committed (which guarentees only that each row is constant within any statement, so a mysql select can see some rows from before a multi-row update and some from after the update).

Mysql picked repeatable read isolation as the default because its read committed implementation, while technically being ANSI-compliant, offers much weaker consistency guarentees than other databases, so you should bump it up one level up from what you would use in other DBs.

If you see concurrency issues at repeatable read, the way to fix them should be to bump the isolation level to SERIALIZABLE in all integration tests against postgres and fix all 40001 exceptions that postgres gives you, even if it that requires changing the schema to facilitate it. The actual problem here seems to be that mysql is bad at propagating error codes from a detected race condition to the user so that airflow used to continue after a transaction was aborted, while postgres + sqlalchemy will give you an explicit exception at higher isolation levels. Fixing the errors raised by postgres should also fix the concurrency issue that caused mysql to ignore transactions.

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label May 7, 2021

potiuk approved these changes May 7, 2021

View reviewed changes

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label May 7, 2021

SamWheating force-pushed the sw-force-read-committed-with-mysql branch from 965b08f to 79c0ff9 Compare May 7, 2021 14:22

ashb approved these changes May 7, 2021

View reviewed changes

SamWheating added 3 commits May 8, 2021 13:33

Enforce READ COMMITTED isolation when using mysql

79adaa5

Fixing up tests, removing indentation

1aba800

Fixing test

ef6e45d

SamWheating force-pushed the sw-force-read-committed-with-mysql branch from 3e9db43 to ef6e45d Compare May 8, 2021 17:48

potiuk added this to the Airflow 2.0.3 milestone May 9, 2021

potiuk approved these changes May 9, 2021

View reviewed changes

potiuk merged commit 231d104 into apache:master May 9, 2021

potiuk pushed a commit that referenced this pull request May 9, 2021

Enforce READ COMMITTED isolation when using mysql (#15714)

f39ac87

* Enforce READ COMMITTED isolation when using mysql * Fixing up tests, removing indentation * Fixing test (cherry picked from commit 231d104)

ashb modified the milestones: Airflow 2.0.3, Airflow 2.1 May 18, 2021

BenoitHanotte mentioned this pull request May 27, 2021

Duplicate tasks invoked for a single task_id when manually invoked task details modal. #10026

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce READ COMMITTED isolation when using mysql #15714

Enforce READ COMMITTED isolation when using mysql #15714

SamWheating commented May 7, 2021 •

edited

Loading

ashb commented May 7, 2021

potiuk commented May 7, 2021

ashb commented May 7, 2021

github-actions bot commented May 7, 2021

github-actions bot commented May 7, 2021

ashb commented May 7, 2021

potiuk commented May 7, 2021

SamWheating commented May 7, 2021 •

edited

Loading

potiuk commented May 7, 2021

potiuk commented May 7, 2021

github-actions bot commented May 7, 2021

github-actions bot commented May 7, 2021

SamWheating commented May 8, 2021 •

edited

Loading

potiuk commented May 8, 2021 •

edited

Loading

potiuk commented May 9, 2021

alippai commented May 14, 2021

potiuk commented May 16, 2021 •

edited

Loading

SamWheating commented May 16, 2021

alippai commented May 16, 2021

alippai commented May 16, 2021

alippai commented May 16, 2021 •

edited

Loading

potiuk commented May 16, 2021 •

edited

Loading

saolof commented Jun 17, 2024

Enforce READ COMMITTED isolation when using mysql #15714

Enforce READ COMMITTED isolation when using mysql #15714

Conversation

SamWheating commented May 7, 2021 • edited Loading

ashb commented May 7, 2021

potiuk commented May 7, 2021

ashb commented May 7, 2021

github-actions bot commented May 7, 2021

github-actions bot commented May 7, 2021

ashb commented May 7, 2021

potiuk commented May 7, 2021

SamWheating commented May 7, 2021 • edited Loading

potiuk commented May 7, 2021

potiuk commented May 7, 2021

github-actions bot commented May 7, 2021

github-actions bot commented May 7, 2021

SamWheating commented May 8, 2021 • edited Loading

potiuk commented May 8, 2021 • edited Loading

potiuk commented May 9, 2021

alippai commented May 14, 2021

potiuk commented May 16, 2021 • edited Loading

SamWheating commented May 16, 2021

alippai commented May 16, 2021

alippai commented May 16, 2021

alippai commented May 16, 2021 • edited Loading

potiuk commented May 16, 2021 • edited Loading

saolof commented Jun 17, 2024

SamWheating commented May 7, 2021 •

edited

Loading

SamWheating commented May 7, 2021 •

edited

Loading

SamWheating commented May 8, 2021 •

edited

Loading

potiuk commented May 8, 2021 •

edited

Loading

potiuk commented May 16, 2021 •

edited

Loading

alippai commented May 16, 2021 •

edited

Loading

potiuk commented May 16, 2021 •

edited

Loading